Automatic Story Segmentation for Spoken Document Retrieval

نویسندگان

  • Pui-Yu Hui
  • Xiaoou Tang
  • Helen M. Meng
  • Wai Lam
  • Xinbo Gao
چکیده

We have been working on speech retrieval based on Cantonese television news programs. Our video archive contains over 20 hours of news programs provided by a local television station. These programs have been hand-segmented into video clips, where each clip is a self-contained news story. The audio tracks in our archive are indexed by Cantonese speech recognition. This is integrated with a vector-space information retrieval model to achieve speech retrieval. This paper proposes an approach for automatic story segmentation from television news programs, intended to replace hand-segmentation as described above. Automatic story segmentation is critical for rapid expansion of our video archive. Our approach relies on the assumption that nearly all the news stories follow the temporal syntax of (begin_story anchor shots field shots end_story). Therefore our algorithm aims to detect field-to-anchor shot boundaries, that should also coincide with the story boundaries. The proposed approach utilizes the video frame information for story boundary detection, and involves such techniques as fuzzy c-means and graph-theoretical clustering. The approach achieved precision and recall values of over 70%, based on a 20-hour video corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mandarin Chinese Broadcast News Retrieval and Summarization Using Probabilistic Generative Models

This paper presents our recent research work on applying probabilistic generative models to Mandarin Chinese broadcast news retrieval and summarization. Most models can be trained in either a supervised or unsupervised manner. In addition, both literal term matching and concept matching strategies have been intensively investigated. This paper also presents a prototype web-based Mandarin Chines...

متن کامل

Initial Experiments on Automatic Story Segmentation in Chinese Spoken Documents Using Lexical Cohesion of Extracted Named Entities

Story segmentation plays a critical role in spoken document processing. Spoken documents often come in a continuous audio stream without explicit boundaries related to stories or topics. It is important to be able to automatically segment these audio streams into coherent units. This work is an initial attempt to make use of informative lexical terms (or key terms) in recognition transcripts of...

متن کامل

Recognition, indexing and retrieval of british broadcast news with the THISL system

This paper described the THISL spoken document retrieval system for British and North American Broadcast News. The system is based on the ABBOT large vocabulary speech recognizer and a probabilistic text retrieval system. We discuss the development of a realtime British English Broadcast News system, and its integration into a spoken document retrieval system. Detailed evaluation is performed u...

متن کامل

Generating Phonetic Cognates to Handle Named Entities in English-Chinese Cross-Language Spoken Document Retrieval

We have developed a technique for automatic transliteration of named entities for English-Chinese cross-language spoken document retrieval (CL-SDR). Our retrieval system integrates machine translation, speech recognition and information retrieval technologies. An English news story forms a textual query that is automatically translated into Chinese words, which are mapped into Mandarin syllable...

متن کامل

Spoken Document Retrieval for TREC-8 at Cambridge University

This paper presents work done at Cambridge University on the TREC-8 Spoken Document Retrieval (SDR) Track. The 500 hours of broadcast news audio was filtered using an automatic scheme for detecting commercials, and then transcribed using a 2-pass HTK speech recogniser which ran at 13 times real time. The system gave an overall word error rate of 20.5% on the 10 hour scored subset of the corpus,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001